37 research outputs found
Recommended from our members
Data Scarcity in Event Analysis and Abusive Language Detection
Lack of data is almost always the cause of the suboptimal performance of neural networks. Even though data scarce scenarios can be simulated for any task by assuming limited access to training data, we study two problem areas where data scarcity is a practical challenge: event analysis and abusive content detection} Journalists, social scientists and political scientists need to retrieve and analyze event mentions in unstructured text to compute useful statistical information to understand society. We claim that it is hard to specify information need about events using keyword-based representation and propose a Query by Example (QBE) setting for event retrieval. In the QBE setting, we assume that there are a few example sentences mentioning the event class a user is interested in and we aim to retrieve relevant events using only the examples as a query. Traditional event detection approaches are not applicable in this setting as event detection datasets are constructed based on pre-defined schemas which limits them to a small set of event and event-argument types. Moreover, the amount of annotated data in event detection datasets is limited that only allows us to build a retrieval corpus for evaluation. Thus we assume that there are no relevance judgments to train an event retrieval model -- except for the few examples of a specific event type. We create three QBE evaluation settings from three event detection datasets: PoliceKilling, ACE, and IndiaPoliceEvents. For the PoliceKilling dataset, where a relevant sentence describes a police killing event, we show that a query model constructed from the NLP features extracted from the few given examples is effective compared to event detection baselines. For the ACE dataset, where there are thirty-three types of events, we construct a QBE setting for each type and show that a sentence embedding approach effectively transfers for event matching. Finally, we conducted a unified evaluation of all three datasets using the sentence-embedding-based model and showed that it outperforms strong baselines.
We further examine the effect of data scarcity in abusive language detection. We first study a specific type of abusive language -- hate speech. Neural hate speech detection models trained from one dataset poorly generalize to another dataset from a different domain. This is because characteristics of hate speech vary based on racial and cultural aspects. Our data scarcity scenario assumes that we have a hate speech dataset from a domain and it needs to generalize to a test set from another domain using the unlabeled data from the test domain only. Thus we assume zero target domain data in this scenario. To tackle the data scarcity, we propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We evaluate the approach with three different models (character CNNs, BiLSTMs, and BERT) on three different collections. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%, with no loss (and in some cases a significant gain) in precision.
Finally, we examine the cross-lingual abusive language detection problem. Abusive language is a superclass of hate speech that includes profanity, aggression, offensiveness, cyberbullying, toxicity, and hate speech itself. There is a large collection of abusive language detection datasets in English such as Jigsaw. For other languages there exist datasets for abusive language detection but with very limited data. We propose a cross-lingual transfer learning approach to learn an effective neural abusive language classifier for such low-resource languages with help from a dataset from a resource-rich language. The framework is based on a nearest-neighbor architecture and is thus interpretable by design. It is a modern instantiation of the classic k-nearest neighbor model, as we use transformer representations in all its components. Unlike prior work on neighborhood-based approaches, we encode the neighborhood information based on query-neighbor interactions. We propose two encoding schemes and show their effectiveness using both qualitative and quantitative analyses. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements in F1 over strong baselines
Paid Academic Writing Services: A Perceptional Study of Business Students
It seems challenging to detect the beneficiary students of the Academic Paid Writing Services, which refers to a practice in which authors or students appoint professional writers to produce scholarly work (including research papers, university assignments, research reports, and so on) with a predefined style. This study aimed to explore the factors leading the students in higher education to choose the paid Academic Writing Services (PAWS), which affects their performance and personal development due to contract cheating and make them realize that learning is better than grades as through self-explorations only a person can get something better. By employing quantitative approach to obtain information associated with PAWS, data was gathered from 117 business students enrolled in six Higher Education Institutes in Karachi, Pakistan, using adopted questionnaire having close-ended questions with 5-point Likert scale, measuring students’ attitude towards class assignments, their awareness about plagiarism, and their attitude about academic paid writing services. The results revealed that male students were more inclined towards paid writing services than their counterpart female students were and the increase in Students’ Attitude towards Assignments brought the increase academic paid writing services. Therefore, academic professionals servicing in universities are recommended to take due care of the two factors to prevent the increased paid academic wiring services
A Multi-Task Architecture on Relevance-based Neural Query Translation
We describe a multi-task learning approach to train a Neural Machine
Translation (NMT) model with a Relevance-based Auxiliary Task (RAT) for search
query translation. The translation process for Cross-lingual Information
Retrieval (CLIR) task is usually treated as a black box and it is performed as
an independent step. However, an NMT model trained on sentence-level parallel
data is not aware of the vocabulary distribution of the retrieval corpus. We
address this problem with our multi-task learning architecture that achieves
16% improvement over a strong NMT baseline on Italian-English query-document
dataset. We show using both quantitative and qualitative analysis that our
model generates balanced and precise translations with the regularization
effect it achieves from multi-task learning paradigm.Comment: Accepted for publication at ACL 201
Harmonic Scalpel Hemorrhoidectomy Vs Milligan-Morgan Hemorrhoidectomy
Background: To compare Harmonic Scalpel Hemorrhoidectomy (HSH) with classical Milligan Morgan Hemorrhoidectomy (MMH) in terms of operation time and post-operative pain to establish effectiveness of this novel procedure.Methods: A total of 62 patients planned for excision hemorrhoidecotmy were randomly selected into HSH and MMH groups. Mean operation time was calculated during surgery and pain at time of first defecation was recorded on visual analog scale (VAS).Results: Mean VAS after surgery at time of first defecation was 4.32 (SD 0.909) in HSH group and 6.97 (SD 1.426) in MMH group (p value <0.000). Mean Operation time in HSH group was 18.13 (SD 3.956) minutes and that of MMH group was 22.90 (SD 4.901) minutes (P value <0.000).Conclusion: Harmonic Scalpel Hemorrhoidectomy is better than Milligan Morgan hemorrhoidectom
Scalable and Effective Generative Information Retrieval
Recent research has shown that transformer networks can be used as
differentiable search indexes by representing each document as a sequences of
document ID tokens. These generative retrieval models cast the retrieval
problem to a document ID generation problem for each given query. Despite their
elegant design, existing generative retrieval models only perform well on
artificially-constructed and small-scale collections. This has led to serious
skepticism in the research community on their real-world impact. This paper
represents an important milestone in generative retrieval research by showing,
for the first time, that generative retrieval models can be trained to perform
effectively on large-scale standard retrieval benchmarks. For doing so, we
propose RIPOR- an optimization framework for generative retrieval that can be
adopted by any encoder-decoder architecture. RIPOR is designed based on two
often-overlooked fundamental design considerations in generative retrieval.
First, given the sequential decoding nature of document ID generation,
assigning accurate relevance scores to documents based on the whole document ID
sequence is not sufficient. To address this issue, RIPOR introduces a novel
prefix-oriented ranking optimization algorithm. Second, initial document IDs
should be constructed based on relevance associations between queries and
documents, instead of the syntactic and semantic information in the documents.
RIPOR addresses this issue using a relevance-based document ID construction
approach that quantizes relevance-based representations learned for documents.
Evaluation on MSMARCO and TREC Deep Learning Track reveals that RIPOR surpasses
state-of-the-art generative retrieval models by a large margin (e.g., 30.5% MRR
improvements on MS MARCO Dev Set), and perform better on par with popular dense
retrieval models